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(54) System and method for providing remote automatic speech recognition services via a 
packet network 



(57) A system and method of operating an auto- 
matic speech recognition service using a client-server 
architecture is used to make ASR services accessible at 
a client location remote from the location of the main 
ASR engine. Tlie present invention utilizes client-server 
communications over a packet networK such as the 



Internet, where the ASR server receives a grammar 
from the client, receives information representing 
speech from the client, performs speech recognition, 
and returns information based upon the recognized 
speech to the client. ' 
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Description 

Technical Field 

This invention relates to speech recognition in gen- 
eral and, more particularly, provides a way of providing 
remotely-accessible automatic speech recognition serv- 
ices via a packet network. 

Background of the Invention 

Techniques for aoconplishing automatic speech 
recognition (ASR) are well known. Among known ASR 
techniques are those that use grammars. A grammar Is 
a representation of the language or phrases expected to 
be used or spoken In a given context. In one sense, 
then. ASR grammars typically constrain the speech rec- 
ognizer to a vocabulary that is a subset of the universe 
of potentially-spoken words; and grammars may include 
sulDgrammars. An ASR grammar rule can then be used 
to represent the set of "phrases" or combinations of 
words from one or more grammars or subgrammars that 
may be expected in a given context "Granunar" may 
also refer generally to a statistical language model 
(where a model represents phrases), such as tiiose 
used in language understanding systems. 

Products and services that utilize some fomn of 
automatic speech recognition fASRT methodology 
have been recentiy introduced commerdally For exanv 
pie, AT&T has developed a granrunar-based ASR 
engine caUed WATSON that enables development of 
complex ASR services. Desirable attiibutes of conplex 
ASR services that would utilize such ASR technology 
include high accuracy in recognition; robustness to ena- 
ble recognition where speakers have differing accents 
or dialects, and/or in the presence of background noise; 
ability to handle large vocabularies; and natural lan- 
guage understanding. In order to achieve these 
attributes for complex ASR services, ASR techniques 
and engines typically require computer-based systerr^ 
having significant processing capability in order to 
achieve the desired speech recognition capability. 
Processing capability as used herein refers to proces- 
sor speed, memory, disk space, as well as access to 
application databases. Such requirements have 
restricted the development of complex ASR services 
that are available at one's desktop, because the 
processing requirements exceed the capabilities of 
most desktop systems, which are typically based on 
personal computer (PC) technology 

Packet networks are general-purpose data net- 
works which are well-suited for sending stored data of 
various types, including speech or audio. The Internet, 
the largest and most renowned of the existing packet 
networks, connects over 4 million computers in some 
140 countries. The Internet's global and exponential 
growth is common knowledge today 

Typically, one accesses a packet network, such as 



the Internet, tiirough a client software program execut- 
ing on a corrputer, such as a PC, and so packet net- 
works are inherently client/server oriented. One way of 
accessing information over a packet network is through 

5 use of a Web browser (such as the Netscape Navigator, 
available from Netscape Communications, Inc., and the 
Internet Explorer, available from Microsoft Corp.) which 
enables a client to Interact witti Web servers. Web serv- 
ers and the information available therein are typically 

w identified and addressed through a Uniform Resource 
Locator (URL)-compatible address. URL addressing is 
widely used in Internet and intranet applications and is 
well known to those skilled in the art (an "intranet" is a 
packet network modeled in functionality based upon the 

15 Internet and is ised. e.g., by companies locally or inter- 
nally). 

What is desired is a way of enabling ASR services 
that may be made available to users at a location, such 
as at their desktop, that is remote from tiie system host- 
20 Ing the ASR engine. 

Summary of the Invention 

A system and method of operating an automatic 

25 speech recognition service using a client-server archi- 
tecture is used to make ASR services accessible at a 
client location remote from the location of the main ASR 
engine. In accordance with the present invention, using 
diem-server communications over a packet network, 

30 such as the Intemet, tiie ASR server receives a gram- 
mar from the dient, receives information representing 
speech from the dient, performs speech recognition, 
and returns information based i4X}n the recognized 
speech to the dient. Alternative embodiments of tiie 

35 present invention include a variety of ways to obtain 
access to the d^red grammar, use of compression or 
feature extraction as a processing step at tiie ASR dient 
pnor to transfen-ing speech information to the ASR 
server, staging a dialogue between client and server, 

40 and operating a form-filling service. 

Brief Description of the Drawings 

FIG. 1 is a diagram showing a client-server relation- 
45 ship for a system providing remote ASR services in 
accordance with the present invention. 

FIG. 2 is a diagram showing a setup process for 
enabling remote ASR services in accordance with tiie 
present invention. 
50 FIG. 3 is a diagram showing an alternative setup 
process for enabling remote ASR services in accord- 
ance witii the present invention. 

FIG. 4 Is a diagram showing a process tor rule 
selection In accordance with the present inventioa 
55 FIG. 5 is a diagram showing a process for enabling 
remote automatic speech recognition in accordance 
witii tiie present invention. 

FIG. 6 is a diagram showing an alternative process 
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for enabling remote automatic speech recognition in 
accordance with the present invention. 

FIG. 7 is a diagram showing another alternative 
process for enabling remote automatic speech recogni- 
tion in accordance with the present Invention. s 

Detailed Description 

The present invention is directed to a client-sender 
based system for providing remotely-available ASR io 
sen^ices. In accordance with the present Invention, ASR 
sen^ices may be provided to a user - e.g., at the user's 
desktop " over a packet network, such as the Internet, 
without the need for the user to obtain computer hard- 
ware having the extensive processing capability is 
required for executing full ASR techniques. 

A basic client-server architecture used in accord- 
ance with the present invention is shown in FIG. 1 . ASR 
server 100 is an ASR software engine which executes 
on a system, denoted as server node 110, that can be 20 
linked across packet network 120 (such as the Internet) 
to other computers. Server node 1 10 may typically be a 
computer having processing capability sufficient for run- 
ning complex ASR-based applications, such as the 
AT&T WATSON system. Packet network 120 may. illus- 25 
tratively, be the Internet or an intranet 

ASR client 130 is a relatively small program (when 
compared to ASR sender 100) that executes on client 
PC 140. Client PC 140 is a computer, such as a per- 
sonal computer (PC), having sufficient processing 30 
capability for running client applications, such as a Web 
browser. Client PC includes hardware, sudi as a micro- 
phone, and software for the input and capture of audio 
sounds, such as speech. Methods for connecting micro- 
phones to a PC and capturing audio sounds, such as 35 
speech, at the PC are well known. Examples of speech 
handling capabilities for PCS include the Speech Appli- 
cation Programmer Interface (SAPI) from l^icrosofl and 
the AT&T Advanced Speech Application Programmer 
Interface (ASAPI). Details of tiie Microsoft SAP! are 40 
found in, e.g., a publication entitied "Speech API Devel- 
opers Guide, Windows™ 95 Edition," Vers. 1.0. Micro- 
soft Corporation (1995), and details of the AT&T ASAPI 
are provided in a publication entitled "Advanced Speech 
API Developers Guide." Vers. 1.0. AT&T Corporation 45 
(1996): each of these publications is incorporated 
herein by reference. An alternative embodiment of the 
present invention may utilize an interface between ASR 
client 130 and one or more voice channels, such that 
speech input may be provided by audio sources otiier 50 
than a microphone. 

Client PC 140 also has the capability of communi- 
cating with other computers over a packet network 
(such as tiie Internet). Methods for establishing a com- 
munications link to other computers over a packet net- ss 
work (such as the Intemet) are well known and include, 
e.g., use of a modem to dial into an Internet service pro- 
vider over a telephone line. 



ASR server 100. through sender node 110. and 
ASR client 130, through client PC 140, may communi- 
cate with one anotiier over packet network 120 using 
known metiiods suitable for communicating information 
Gncluding the transmission of data) over a packet net- 
work using, e.g., a standard communications protocol 
such as tiie Transmission Control Protocol/Internet Pro- 
tocol (TCP/IP) socket. A TCP/IP socket is analogous to 
a "pipe" through which information may be transmitted 
over a packet network from one point to another. 

Establishment of a TCP/IP socket between ASR 
server 100 and ASR client 130 will enable tfie transfer of 
data between ASR server 100 and ASR client 130 over 
packet network 120 necessary to enable remote ASR 
services in accordance with the present invention. ASR 
client 130 also interfaces with audio/speech input and 
output capabilities and text/graphics display capabilities 
of client PC 140. Methods and interfaces for handling 
input and output of audio and speech are well known, 
and text and graphics display handling metiiods and 
interfaces are also well known. 

ASR client 130 may be set up to run in client PC 
140 in several ways. For example. ASR client 130 may 
be loaded onto dient PC 140 from a permanent data 
storage medium, such as a magnetic disk or CD-ROM. 
In tiie alternative, ASR client 130 may be downloaded 
from an infbmiation or data source locatable over a 
packet network, such as the Internet Downloading of 
ASR client 130 may, e.g., be accomplished once to 
reside permanentiy in client PC 140; alternatively, ASR 
client 130 may be downloaded for single or limited use 
purposes. ASR client 130 may be implemented. e.g.. as 
a small plug-in software module for another program, 
such as a Web browser, that executes on client PC 1 40. 
One way of accomplishing tills is to make ASR client 
130 an Active-X software component according to tiie 
Microsoft Active-X standard. In this way, ASR client 130 
may. e.g., be loaded into client PC 140 in conjunction 
with a Web browsing session as follows: a user brows- 
ing tiie World Wide Web using client PC 140 enters a 
Web site having ASR capability; the Web site asks tiie 
user permission to download an ASR client module into 
dient PC 140 in accordance with signed Active-X con- 
trol; upon the users authorization. ASR dient 130 Is 
downloaded into client PC 140. Similarly, ASR server 
100 may be set up to run in server node 1 10 in several 
ways, for example, ASR server may be loaded onto 
server node 100 from a permanent data storage 
medium, such as a magnetic disk or CD-ROM, or, in tfie 
alternative, ASR server 100 may be downloaded from 
an infonftiation or data source locatable over a packet 
network, such as tiie Internet 

Further details of provkJing remote ASR services in 
accordance witti the present invention will now be 
described with reference to FIGS. 2-7, It is presumed for 
tiie discussion to follow witii respect to each of these fig- 
ures that the dient-sen^er relationship is as shown in 
FIG. 1. A setup phase is used to prepare ASR server 
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100 and ASR client 130 for performing an automatic 
speech recognition task as part of an ASR application. 
For convenience, items shown in FIQ. 1 and appearing 
in other figures will be identified by the same reference 
numbers as in FIQ. 1. 5 

Referring new to FIG. 2, a setup phase in a process 
of providing remote ASR services will now be 
described. At step 201, ASR client 130 receives a 
request from the application to load a client grammar. 
The client grammar is illustratively a data file containing io 
Infonnation representing the language (e.g., words and 
phrases) expected to be spoken in the context of the 
particular ASR application. The data file may be in a 
known format, such as the Standard Grammar Format 
(SQF) which is part of the Microsoft SAPL is 

For purposes of illustration, an ASR application for 
taking a pizza order will be used in describing the 
present invention. An ASR service application, such as 
an application for pizza-adering, would typically include 
a program that interfaces with and uses ASR client 1 30 20 
as a resource used for accomplishing the tasks of the 
ASR application. Such an ASR application could reside 
and execute, in whole or in part in client PC 140. 

Considering the pizza ordering example, the client 
grammar PIZZA wouki include information representing es 
words that one may use in ordering pizza, such as 
"pizza." "pepperoni," etc. In fact, subgrammars may be 
used to buiki an appropriate grammar. For the pizza 
ordering example, subgrammars for the PIZZA gram- 
mar could include SIZE and TOPPING. The subgranv 30 
mar SIZE could consist of words used to desabe the 
size of the pizza desired, such as "small," "medium" and 
"large." The subgrammar TOPPING may consist of 
words used to describe the various toppings one may 
order with a pizza. e.g.. "sausage." "pepperoni." "mush- 35 
room" and the like. 

ASR client 130 may be given the desired grammar 
from the application or. alternatively, ASR client 130 
may choose the grammar from a predetermined set 
based upon information provided by the application. 40 
Either way. ASR client 130 then at step 202 sends the 
desired grammar file to ASR server 100 over a TCP/IP 
socket A new TCP/IP socket may have to be set up as 
part of establishing a new communications session 
between client PC 140 and server node 100, or the 45 
TCP/IP socket may already exist as the result of an 
established communications session between client PC 
140 and server node 110 that has not been terminated. 
In the pizza ordering illustration. ASR client 130 would 
cause transmission of a file containing the PIZZA gram- so 
mar to ASR server 1 00 over a TCP/IP socket. 

At step 203, ASR server 100 receives ttie client 
grammar sent from ASR client 130 and. at step 204. 
ASR server loads tiie transmitted client grammar. As 
used herein, "loading" of tiie client grammar means to ss 
have the grammar accessible, for use by ASR server 
106, e.g. by storing tiie grammar in RAM of server node 
1 10. At step 205, ASR server 100 returns a grammar 



"handle" to ASR client 130. A grammar "handle" is a 
marker, such as, e.g., a pointer to memory containing 
tiie loaded grammar, tiiat enables ASR client to easily 
refer to tiie grammar during tiie rest of ttie communica- 
tions session or application execution. ASR client 130 
receives the grammar handle from ASR server 100 at 
step 206 and returns ttie handle to tiie application at 
step 207. For ttie pizza ordering example. ASR server 
100 would receive and load the ti'ansmitted PIZZA 
grammar file and transmit back to ASR client 1 30 a han- 
dle pointing to ttie loaded PIZZA grammar. ASR client, 
in turn, wouW receive the PIZZA handle from ASR 
server 100 and return tiie PIZZA handle to the pizza 
ordering application. In this way, tiie application can 
simply refer to tiie PIZZA handle when carrying out or 
initiating an ASR task as part of the pizza ordering appli- 
cation. 

An alternative setup approach will now be 
descra)ed witii reference to FIG. 3. It is assumed for ttie 
remainder of tiie description herein tiiat transmission or 
comnuinication of information or data between ASR 
server 100 and ASR client 130 take place over an 
established TCP/IP socket. At step 301 . ASR client 130 
receives a request from ttie application to load a client 
grammar. Rattier than send ttie client grammar as a 
data fOe to ASR server 100 at step 302, however, ASR 
client 130 instead sends to ASR server 100 an identifier 
representing a "canned" grammar; a "canned grammar 
woukJ. e.g.. be a common grammar, such as TIME-OF- 
DAY or DATE, which ASR server 100 wouM already 
have staed. Alternatively, ASR client 130 could send to 
ASR server 100 an IP address, such as a URL-oompat- 
ible address, where ASR server 100 couki find ttie 
desired grammar file. ASR sender 100 at step 303 
receives tiie grammar kJentifier or URL grammar 
address from ASR client 130, locates and loads tiie 
requested client grammar at step 304. and at step 305 
returns a grammar handle to ASR client 130. Similar to 
tiie steps described above witii respect to FIG. 2, ASR 
client 130 receives ttie grammar handle from ASR 
server 100 at step 306 and returns ttie handle to ttie 
application at step 307. For tiie pizza ordering example, 
tiie steps desaibed above in connection with FIG. 2 
would be tiie same, except tiiat ASR client 130 would 
send to ASR sender 100 a grammar kJentifier for ttie 
PIZZA grammar (if it were a "canned" grammar) or a 
URL address for tiie location of a ffle containing tiie 
PIZZA grammar; ASR server 100 would, in turn, retrieve 
a ffle for ttie PIZZA grammar based upon the grammar 
identifier or URL address (as sent by the ASR client) 
and ttien load the requested PIZZA grammar. 

After ttie grammar has been loaded and a grammar 
handle returned to ASR dient 130. an ASR senoce 
application needs to select a grammar rule to be acti- 
vated. FIG. 4 shows a process for grammar rule selec- 
tion in accordance witii ttie present invention. ASR 
client 130 receives from ttie application a request to 
activate a grammar rule at step 401 . At step 402. ASR 
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client serrds a rute activate request to ASR server 100; 
as shown in FIG 4. ASR client 1 30 may also at step 402 
send to ASR server 100 the previously-returned gram- 
mar handle (which may enable ASR server to activate 
the appropriate grammar rule for the particular grammar 5 
as identified by the grammar handle). ASR server 100 
at step 403 receives the rule activate request and gram- 
mar handle (if sent). At step 404, ASR server 100 acti- 
vates the requested rule and, at step 405, returns to 
ASR client 130 notification that the requested rule has 10 
been activated. ASR client 130 receives at step 406 the 
notification of rule activation and notifies the application 
at step 407 that the rule has been activated. Once the 
application receives notice of rule activation, it may then 
initiate recognition of speech. 75 

For purposes of illustrating the process shown in 
FIG. 4, again consider the pizza ordering example. A 
rule that may be used for recognizing a pizza order may 
set the desired phrase for an order to include the sub- 
grammars SIZE and TOPPINGS along with the word 20 
"pizza," and might be denoted in the following manner: 
{ORDER = SIZE "pizza" "with" TOPPINGS}. With refer- 
ence again to FIG. 4, ASR client 1 30 would receive from 
the application a request to activate a pizza ordering 
rule and send the ORDER rule set out above to ASR 25 
server 100 along with the PIZZA grammar handle. ASR 
server receives the rule activate request along with the 
PIZZA grammar handle and activates the ORDER rule, 
such that the recognizer would be constrained to recog- 
nizing words from the SIZE subgrammar, the word 30 
"pizza." the word "Vnth" and words from the subgranv 
mar TOPPINGS. After activating the ORDER rule, ASR 
server 100 sends notification of the rule activation to 
ASR client 130 which, in turn notifies the application. 

Once a grammar rule has been activated, the 35 
processing of speech for purposes of recognizing words 
in the grammar according to the rule can take place. 
Refemng to FIG. 5, at step 501 ASR client 130 receives 
a request from the application to initiate a speech recog- 
nition task. A! step 502, ASR client 130 requests 40 
streaming audio from the audio irput of client PC 140. 
Streaming audio refers to audio being processed "on 
the fly" as more audio comes in; the system does not 
wait for all of the audio input (i a, the entire speech) 
before sending the audio along for digital processing; 45 
streaming audio may also refer to partial transmission of 
part of the audio signal as additional audio is input. Illus- 
tratively, a request for streaming audio may be accom- 
plished by making an appropriate software call to the 
operating system running on client PC 140 such that so 
streaming audio from the microphone input is digitized 
by the sound processor of dient PC 140. Streaming 
audio digitized from the microphone input is then 
passed along to ASR client 1 30. ASR client 1 30 then ini- 
tiates transmission of streaming digitized aucfio to ASR ss 
server 100 at step 503; like the audio input from the 
microphone, the digitized audio is sent to ASR server 
100 "on the fly" even while speech input continues. 



At step 504, ASR server 100 performs speech rec- 
ognition on the streaming digitized audio as the audio is 
received from ASR client 130. Speech recognition is 
performed using known recognition algorithms, such as 
those employed by the AT&T WATSON speech recogni- 
tion engine, and is performed within the constraints of 
the selected grammar as defined by the activated rule. 
At step 505, ASR server 1 00 returns streaming text (i.e. , 
partially recognized speech) as the input speech is rec- 
ognized. Thus, as ASR server 100 reaches its initial 
results, it returns those results to ASR client 130 even 
as ASR server 100 continues to process additional 
streaming audio being sent by ASR client 130. This 
process of returning recognized text "on the fly" permits 
ASR client 130 (or the application interlacing with ASR 
client 130) to provide feedback to the speaker. As ASR 
server 100 continues to process additional streaming 
input audio, it may con-ect the results of the earlier 
speech recognition, such that the returned text may 
actually update (or corect) parts of the text already 
returned to ASR dient 130 as part of the speech recog- 
nition task Once all of the streaming audio has been 
received from ASR dient 130. ASR server completes its 
speech recognition processing and returns a final ver- 
sion of recognized text (induding corrections) at step 
506. 

At step 507. ASR dient 130 receives the recog- 
nized text from ASR server 100 and returns the text to 
the application at step 508. Again, this may be done "on 
the fly" as the recognized text comes in, and ASR dient 
passes along to the application any corrections to rec- 
ognized text received from ASR server 100. 

Refening to the pizza ordering example, once ttie 
ORDER rule has been activated and the application 
notified. ASR dient 130 will receive a request to initiate 
speech recognition and will initiate streaming audio 
from the microphone input The speaker may be 
prorrpted to speak the pizza order, and once speaking 
begins. ASR dient 130 sends digitized streaming audio 
to ASR server 100. Thus, as the speaker states, eg., 
tiiat she wants to order a large pizza with sausage and 
pepperoni." ASR dient 130 will have sent digitized 
streaming audio for the first word of ttie order along to 
ASR server 100 even as tiie second word is being spo- 
ken. ASR server 100 will, as ttte order is being spoken, 
return the first word as text "large" as the rest of the 
order is being spoken. Ultimately, once ttie speaker 
stops speaking, ttie final recognized text for ttie order, 
"large pizza witti sausage, pepperoni" can be returned 
to ASR client 130 and, hence, to tiie application. 

An altemative embodiment for carrying out ttie 
speech recognition process in accordance witti the 
present invention is shown in FIG. 6. Similar to ttie 
speech recognition process shown in FIG. 5. at step 
601 ASR dient 130 receives a request from ttie applica- 
tion to initiate a speech recognition task and, at step 
602. ASR dient 130 requests streaming audio from the 
audio input of dient PC 140. Streaming audio digitized 
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from the microphone irput is then passed along to ASR 
client 130. At step 603, ASR client 130 compresses the 
digitized audio '*on the fly" and then initiates transmis- 
sion of streaming compressed digitized audio to ASR 
server 1 00, while speech input continues. 5 

At step 604, ASR server 100 decompresses the 
conpressed audio received from ASR client 130 before 
performing speech recognition an the streaming digi- 
tized audio. As described above with reference to FIG. 
5, speech recognition is performed within the con- 10 
straints of the selected grammar as defined by the acti- 
vated rule. At step 605. ASR server 100 returns 
streaming text (i.e., partially recognized speech) as the 
input speech is recognized. Thus, ASR server 100 
returns Initial results to ASR client 130 even as ASR is 
server 100 continues to process additional compressed 
streaming audio being sent by ASR client 130, and may 
update or correct parts of the text already returned to 
ASR client 130 as part of the speech recognition task. 
Once all of the streaming audio has been received from 20 
ASR client 130. ASR server conpletes its speech rec- 
ognition processing and returns a final version of recog- 
nized text (including oon-ections) at step 606. ASR client 
130 receives the recognized text from ASR server 100 
at step 607 as it comes in and returns the text to the 2s 
application at step 608. 

Another altemative embodiment for carrying out the 
speech recognition process in accordance with the 
present invention is shown in FIG. 7. Similar to the 
speech recognition process shown in FIGS. 5 and 6, at 30 
step 701 ASR client 130 receives a request from the 
application to initiate a speech recognition task and, at 
step 702, ASR client 130 requests streaming audio from 
tile audio input of client PC 140. Streaming audio digi- 
tized from tiie microphone input is then passed along to 35 
ASR client 130. At step 703. ASR client 130 processes 
tiie digitized audio "on the fly" to extract features useful 
for speech recognition processing and then initiates 
transmission of extracted features to ASR server 100, 
while speech input continues. Extraction of relevant fea- 40 
tures from speech involves grammar-independent 
processing ttiat is typically part of algorKhms employed 
for speech recognition, and may be done using metiiods 
known to those skilled in the art such as those based 
upon linear predictive coding (LPC) or Mel f Dter bank 45 
processing. Feature extraction provides infonnation 
obtained from characteristics of voice signals while 
eliminating unnecessary information, such as volume. 

Upon receiving extracted features from ASR client 
1 30, ASR server 1 00 at step 704 performs speech rec- so 
ognition on tiie incoming features which are arriving "on 
the fly" (i.e., in manner analogous to streanrvng audio). 
Speech recognition is peribrmed witiiin tiie constraints 
of tiie selected grammar as defined by tiie activated 
rule. As is tiie case witii tiie embodiments discussed ss 
above witii reference to FIGS. 5 and 6, at step 705 ASR 
server 100 returns streaming text (i.e., partially recog- 
nized speech) to ASR dient 130 as tiie input features 



are recognized. ASR server 100 continues to process 
additional extracted features being sent by ASR client 
130. and may update or correct parts of the text already 
returned to ASR client 130. ASR server completes its 
speech recognition processing upon receipt of all of tiie 
extracted features from ASR client 130, and returns a 
final version of recognized text (including corrections) at 
step 706. ASR client 130 receives tiie recognized text 
from ASR server 100 at step 707 as it comes in and 
returns tiie text to tiie application at step 708. 

The altemative embodiments described above witii 
respect to FIGS. 6 and 7 each provide for additional 
processing at the client end. For tiie embodiment in 
FIG. 6. tills entails compression of the streaming audio 
(witii audfo decompression at tiie server end); for tiie 
embodiment in FIG. 7, this included part of the speech 
recognition processing in the form of feature exb-action. 
Using such additional processing at the client end sig- 
nificantiy reduces tie amount of data transmitted from 
ASR dient 130 to ASR server 100. Thus, less data is 
required to represent tiie speech signals being transmit- 
ted. Where feature extraction is accomplished at the di- 
ent end, such benefits are potentially sharply increased, 
because extracted features (as opposed to digitized 
voice signals) require less data and no features need be 
sent during periods of silence. The reduction of data 
produces a dual desired benefit: (1) it pemiits a reduc- 
tion in bandwidtii required to achieve a certain level of 
performance, and (2) it reduces tiie transmission time in 
sending speech data from ASR dient to ASR server 
tiirough tiie TCP/IP socket 

While typically a grammar rule will be activated 
prior to the initiation of transmission of speech informa- 
tion from ASR dient 130 to ASR server 100. rule activa- 
tion could take place after some or all of tiie speech 
information to be recognized has been sent from ASR 
dient 130 to ASR server 100. In such a drcumstance, 
ASR server 100 would not begin speech recognition 
efforts until a grammar rule has been activated. Speech 
sent by ASR dient 130 prior to activation of a grammar 
mle could be stored temporarily by ASR server 100 to 
be processed by the recognizer a, alternatively, such 
speech could be ignored. 

Further, multiple speech recognition tasks may be 
executed using tiie techniques of tiie present invention. 
For example, an ASR application could request ASR cli- 
ent 130 to Instruct ASR server 100 to load a canned 
grammar for a telephone number (i e., "PHONE 
NUMBER") and then request activation of a rule cover- 
ing spoken numbers. After a phone number is spoken 
and recognized in accordance witii the present inven- 
tion (e.g., in response to a prompt to speak ttie phone 
number, ASR dient 130 sends digitized spoken num- 
bers to ASR server 100 for recognition), tiie ASR appli- 
cation could tiien request ASR client 130 to set up and 
initiate recognition of pizza ordering speech (e.g., load 
PIZZA grammar, activate ORDER rule, and initiate 
speech recognition) in accordance witii the examples 
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described above with reference to FIGS. 2-5. 

In addition to the sinple pizza ordering example 
used above for illustration, a wide array of potential ASR 
services may be provided over a packet network in 
accordance with the present invention. One example of s 
an ASR application enabled by the present invention is 
a form-filling service for completing a form in response 
to spoken responses to information requested for each 
of a number of blanks in the form. In accordance with 
the present invention, a form-filling service may be io 
implemented wherein ASR client 130 sends grammars 
representing the possible choices for each of the blanks 
to ASR server 100. For each Wank, ASR client 130 
requests activation of the appropriate grammar rule and 
sends a conresponding spoken answer made in is 
response to a request for information needed to com- 
plete the blank. ASR server 100 applies an appropriate 
speech recognition algorithm in accordance with the 
selected grammar and rule, and returns text to be 
inserted in the fbnm. 20 

Other ASR services may involve an exchange of 
information (e.g., a dialogue) between server and client 
For example, an ASR service application for handling 
flight reservations may, in accordance with the present 
invention as desaibed herein, utilize a dialogue 25 
between ASR server 1 00 and ASR client 1 30 to accom- 
plish the ASR task. A dialogue may proceed as follows: 

Speaker (through ASR client 130 to ASR server 
100): 30 

"I want a flight to Los Angeles." 
ASR server's response to ASR client (in the form of 
text or. alternatively, speech returned by ASR 
server 100 to ASR client 130): 

"From what city will you be leaving?" 3s 
Speaker (through ASR client to ASR server): 

"Washington, DC." 
ASR server's response to ASR client: 

"What day do you want to leave" 
Weaker (ASR client to ASR server): 4o 

"Tuesday." 
ASR server's response to ASR client: 

"What time do you want to leave" 
Speaker (ASR client to ASR server): 

"4 O'clock in the afternoon." 45 
ASR server's response to ASR client: 

"I can book you on XYZ Airline flight 4567 
from Washington. DC to Los Angeles on Tuesday at 
4 O'clock PM. Do you want to reserve a seat on this 
flight?" so 

In this case, the infonnation received from ASR 
server 1 10 is not literally the text from the recognized 
speech, but is information based upon the recognized 
speech (which would depend upon the application), ss 
Each leg of the diafogue may be accomplished in 
accordance witti the ASR client-sender method 
desaibed above. As may be observed from this exam- 



ple, such an ASR service application requires of the 
ASR client and ASR server not only ttie ability to handle 
natural language, but also access to a large database 
that is constantiy changing. To accomplish ttiis. it may 
be desirable to have the ASR service application actu- 
ally installed and executing in server node 110, rather 
than in client PC 140. Client PC 140 would, in that case, 
merely have to run a relatively small "agent" program 
that, at tiie control of the application program running at 
server node 110, initiates ASR client 130 and shep- 
herds the speech input through ASR client 130 along to 
ASR server 100. An example of such an "agenf pro- 
gram may be. e.g., one that places a "talking head" on 
the screen of client PC 140 to assist the interaction 
between an individual using the ASR service applica- 
tion at client PC 140 and, through ASR client 130 and 
ASR server 100, send the person's speech information 
along to ASR server 100 for recognition. 

In summary, the present invention provides a way of 
providing ASR services that may be made available to 
users over a packet network such as tiie Internet, at a 
location remote from a system hosting an ASR engine 
using a client-server architecture. 

What has been described is merely illustrative of 
ttie application of the prindples of the present invention. 
Otiier arrangements and methods can be implemented 
by those skilled in tiie art without departing from the 
spirit and scope of the present invention. 

Where technical features mentioned in any daim 
are followed by reference signs, those reference signs 
have been induded for the sole purpose of increasing 
tiie intelligibility of the daims and accordingly, such ref- 
erence signs do not have any limiting effect on the 
scope of each element identified by way of example by 
such reference signs. 

Claims 

1 . A metiiod of operating an automatic speech recog* 
nition service accessible by a dient over a packet 
network, comprising the steps of: 

a. receiving from the client over the packet net- 
work information conresponding to a grammar 
used for speech recognition: 

b. receiving from tiie dient over the packet net- 
work infonnation representing speech; 

c. recognizing the received speech information 
by applying an automatic speech recognition 
algoritiim in accordance with tiie grammar; and 

d. sending infonnation based upon the recog- 
nized speech over the packet network to tiie di- 
ent. 

2. The invention according to daims 1 or 28, furtiier 
comprising tiie step of if tiie information corre- 
sponding to a grammar is an address con-espond- 
ing to tiie location of a granvnar, obtaining access 
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to a grammar located at the corresponding granv 
mar address. 

3. The invention according to claims 2, 13 or 21, 
wherein the address corresponding to the location s 
of a gramn^ is a uniform resource focator-compat- 
ibie address. 

4. The invention according to claims 1, 12, 20 or 28, 
wherein the information representing speech io 
anrives from the client in a streaming manner; or 

wherein the information representing speech 
received from the client comprises digitized 
speech; or 

wherein the information representing speech is 
received from the client comprises compressed dig- 
itized speech; or 

wherein the information representing speech 
received from the client comprises features 
extracted by the client from digitized speech. 20 

5. The invention according to claim 1. wherein the 
step of recognizing the received speech infomnation 
is repeated as new speech infornf^tion is received 
from the client. 25 

6. The invention according to claims 1, 9, 12, 17, 20, 
25 wherein the information based upon the recog- 
nized speech comprises text information; or 

wherein the information based upon the rec- so 
ognized speech comprises additional speech. 

7. The invention according to daim 1, wherein the 
step of sending information based upon the recog- 
nized speech is repeated as additional speech 35 
information is recognized. 

8. The invention according to claim 7, further conpris- 
ing the step of sending to the client revised informa- 
tion based upon recognized speech previously sent 40 
to tfie client. 

9. The invention according to claim 1, wherein the 
steps of b, c and d are repeated to create an 
exchange of information between client and server. 45 

10. The invention according to claims 1 or 28. further 
comprising tiie step of activating a grammar rule in 
response to a request received from tiie client over 
the packet network. so 

11. The invention according to claims 1 or 28, furtiier 
comprising tiie step of sending over the packet net- 
work to tiie client a handle corresponding to the 
grammar. ' 55 

1 2. A system for operating an automatic speech recog- 
nition service accessible by a client over a packet 



network, comprising: 

a. a programmable processor; 

b. memory; 

c. an audio Input device; and 

d. a communications interface for establishing 
a communications link with the client over the 
packet network; 

wherein said processor is programmed 
to execute the steps of: 

i. receiving from the client over ttie packet 
network information corresponding to a 
grammar used for speech recognition; 

ii. receiving from tiie client over the packet 
network information representing speech; 

iii. recognizing the received speech infor- 
mation by applying an automatic speech 
recognition algorithm in accordance witti 
the grammar; and 

iv. sending information based upon tiie 
recognized speech over tiie packet net- 
work to the client. 

13. The invention according to claim 12, wherein the 
processor is further programmed to execute the 
step of if the information corresponding to a gram- 
mar is an address corresponding to the location of 
a gramnar. obtaining access to a granvnar located 
at the con-esponding grammar address. 

14. The invention according to claim 12, wherein the 
processor is furtiier programmed to repeat tiie step 
of recognizing the received speech information as 
new speech information is received from tiie client 

15. The invention according to claim 12, wheran the 
processor is further programmed to repeat tiie step 
of sending information based upon the recognized 
speech as additional speech information is recog- 
nized. 

16. The invention according to claim 15, wherein tiie 
processor is furtiier programmed to execute the 
step of sending to the client revised information 
based upon recognized speech prevfously sent to 
tiie client. 

17. The Invention according to claim 12, wherein the 
processor is further programmed to repeat the 
steps of b, c and d to create an exchange of infor- 
mation between client and server. 

18. The invention according to claim 12, wherein the 
processor is furtiier programmed to execute the 
step of activating a grammar rule in response to a 
request received from the client over tiie packet net- 
work 
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19. The invention according to daim 12, wherein the 
processor is further programmed to execute the 
step of sending over the packet networl< to the client 
a handle corresponding to the grammar. 

20. An article of manufacture, comprising a computer- 
readable medium having stored thereon instruc- 
tions for operating an automatic speech recognition 
service accessible by a dient over a packet net- 
worK said instructions which, when performed by a 
processor, cause the processor to execute a series 
of steps comprising: 

a. receiving from the dient over the packet net- 
work information corresponding to a grammar 
used for speech recognition; 

b. receiving from the client over the packet net- 
work information representing speech; 

c. recognizing the received speech information 
by applying an automatic speech recognition 
algorithm in accordance with the grammar; and 

d. sending information based upon the recog- 
nized speech over the packet network to the cli- 
ent. 

21. The invention according to daim 20. wherein the 
instructions, when performed by a processor, fur- 
ther cause the processor to execute the step of if 
the information con-esponding to a grammar is an 
address corresponding to the location of a granv 
mar. obtaining access to a grammar located at the 
corresponding granvnar address. 

22. The invention according to daim 20. wherein the 
instructions, when performed by a processor, fur- 
ther cause the processor to repeat the step of rec- 
ognizing the received speech information as new 
speech information is received from the client 



26. The invention according to claim 20, wherein the 
instructions, when performed by a processor, fur- 
ther cause the processor to execute the step of acti- 
vating a grammar rule in response to a request 

5 received from the client over the packet network. 

27. The invention according to claim 20. wherein the 
instructions, when performed by a processor, fur- 
ther cause the processor to execute tiie step of 

10 sending over the packet network to the client a han- 
dle conresponding to the grammar 

28. A method of operating an automatic form fHling 
service accessible by a dient over a packet net- 

IS work, comprising the steps of: 

a. receiving from the dient over the packet net- 
work information corresponding to a grammar 
used for speech recognition, wherein said 

20 grammar conresponds to words assodated 

with text infomiation to be inserted in the form; 

b. receiving from the dient over the packet net- 
work infonnation representing speech; 

c. recognizing the received speech information 
25 by applying an automatic speech recognition 

algorithm in accordance with the grammar; and 

d. sending text corresponding to tiie recog- 
nized speech over the packet network to the di- 
ent for insertion in the form. 

30 



35 



23, The invention according to daim 20, wherein the 4o 
instructions, when performed by a processor, fur- 
ther cause the processor to repeat the step of send- 
ing information based upon the recognized speech 

as additional speech information is recognized. 

45 

24. The invention according to daim 23, wherein the 
instructions, when performed by a processor, fur- 
ther cause tiie processor to execute tiie step of 
sending to tiie client revised infomiation based 
upon recognized speech previously sent to the cli- 50 
ent 



25. The invention according to daim 20, wherein tiie 
instructions, when performed by a processor, fur- 
ther cause the processor to repeat tiie steps of b, c ss 
and d to create an exchange of information 
between client and server. 
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